DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

DZone Spotlight

Sunday, October 19 View All Articles »
Porting From Perl to Go: Simplifying for Platform Engineering

Porting From Perl to Go: Simplifying for Platform Engineering

By Mark Gardner DZone Core CORE
The Problem With the brew upgrade Command By default, the brew upgrade command updates every formula (terminal utility or library). It also updates every cask (GUI application) it manages. All are upgraded to the latest version — major, minor, and patch. That’s convenient when you want the newest features, but disruptive when you only want quiet patch-level fixes. Last week, I solved this in Perl with brew-patch-upgrade.pl, a script that parsed brew upgrade’s JSON output, compared semantic versions, and upgraded only when the patch number changed. It worked, but it also reminded me how much Perl leans on implicit structures and runtime flexibility. This week I ported the script to Go, the lingua franca of DevOps. The goal wasn’t feature parity — it was to see how Go’s design choices map onto platform engineering concerns. Why Port to Go? Portfolio practice: I’m building a body of work that demonstrates platform engineering skills.Operational focus: Go is widely used for tooling in infrastructure and cloud environments.Learning by contrast: Rewriting a working Perl script in Go forces me to confront differences in error handling, type safety, and distribution. The Journey Error Handling Philosophy Perl gave me try/catch (experimental in the Perl v5.34.1 that ships with macOS, but since accepted into the language in v5.40). Go, famously, does not. Instead, every function returns an error explicitly. Perl use v5.34; use warnings; use experimental qw(try); use Carp; use autodie; ... try { system 'brew', 'upgrade', $name; $result = 'upgraded'; } catch ($e) { $result = 'failed'; carp $e; } Go package main import ( "os/exec" "log" ) ... cmd := exec.Command("brew", "upgrade", name) if output, err := cmd.CombinedOutput(); err != nil { log.Printf("failed to upgrade %s: %v\n%s", name, err, output) } The Go version is noisier, but it forces explicit decisions. That’s a feature in production tooling: no silent failures. Dependency Management Perl: cpanfile + CPAN modules. Distribution means “install Perl (if it’s not already), install modules, run script.” Tools like carton and the cpan or cpanm commands help automate this. Additionally, one can use further tooling like fatpack and pp to build more self-contained packages, but those are neither common, nor (except for cpan) distributed with Perl.Go: go.mod + go build. Distribution is a single (platform-specific) binary. For operational tools, that’s a massive simplification. No runtime interpreter, no dependency dance. Type Safety Perl let me parse JSON into hashrefs and trust the keys exist. Go required a struct: Go type Formula struct { Name string `json:"name"` CurrentVersion string `json:"current_version"` InstalledVersions []string `json:"installed_versions"` } The compiler enforces assumptions that Perl left implicit. That friction is valuable — it surfaces errors early. Binary Distribution This is where Go shines. Instead of telling colleagues “install Perl v5.34 and CPAN modules,” I can hand them a binary. No need to worry about scripting runtime environments — just grab the right file for your system. homebrew-semver-guard-darwin (Universal Binary for macOS)homebrew-semver-guard-linux-amd64 (Intel/AMD 64-bit binary for Linux)homebrew-semver-guard-linux-arm64 (Arm 64-bit binary for Linux) These are available on the release page. Download, run, done. Semantic Versioning Logic In Perl, I manually compared arrays of version numbers. In Go, I imported golang.org/x/mod/semver: Go import ( golang.org/x/mod/semver ) ... if semver.MajorMinor(toSemver(formula.InstalledVersions[0])) != semver.MajorMinor(toSemver(formula.CurrentVersion)) { log.Printf("%s is not a patch upgrade", formula.Name) results.skipped++ continue } Cleaner, more legible, and less error-prone. The library encodes the convention, so I don’t have to. Deliberate Simplification I didn’t port every feature. Logging adapters, signal handlers, and edge-case diagnostics remained in Perl. The Go version focuses on the core logic: parse JSON, compare versions, and run upgrades. That restraint was intentional — I wanted to learn Go’s idioms, not replicate every Perl flourish. Platform Engineering Insights Three lessons stood out: Binary distribution matters. Operational tools should be installable with a single copy step. Go makes that trivial.Semantic versioning is an operational practice. It’s not just a convention for library authors — it’s a contract that tooling can enforce.Go’s design aligns with platform needs. Explicit errors, type safety, and static binaries all reduce surprises in production. Bringing It Home This isn’t a “Perl vs. Go” story. It’s a story about deliberate simplification, taking a working Perl script and recasting it in Go. The aim is to see how the language’s choices shape a solution to the same problem. The result is homebrew-semver-guard v0.1.0, a small but sturdy tool. It’s not feature-finished, but it’s production-ready in the ways that matter. Next up: I’m considering more Go tools, maybe even Kubernetes for services on my home server. This port was practice, an artifact demonstrating platform engineering in action. Links Original Perl script: brew-patch-upgrade.plGo release: homebrew-semver-guard v0.1.0Last week’s post: Patch-Perfect: Smarter Homebrew Upgrades on macOS More
The Ethics of AI Exploits: Are We Creating Our Own Cyber Doomsday?

The Ethics of AI Exploits: Are We Creating Our Own Cyber Doomsday?

By Omkar Bhalekar
As artificial intelligence advances at rates never previously encountered, its impact upon society is taking hold ever more profoundly and extensively. From autonomous vehicles and personalized medicine to generative media and intelligent infrastructure, AI is changing every area it touches. But lurking in the background of these revolutionary promises is a chilly, black fear: Are we also building the tools of our own digital demise? The ethics of AI exploits, although intentional or emergent, raise profoundly disturbing questions about the cybersecurity, anonymity, and even global security of the future. A Double-Edged Sword AI is widely hailed as something good. Its ability to detect threats, analyze great volumes of data, and make complex decisions on its own has established it as a cornerstone of modern innovation. But with every great technology comes the risk of dual-purpose utilization. The same machine learning that can detect cancer can be repurposed as a weapon to detect zero-day vulnerabilities in computer systems. The same algorithms that make traffic predictions can be used to bypass intrusion prevention systems. The large language models that power conversational AI can be exploited to generate phishing emails indistinguishable from legitimate communication or worse, to socially engineer their way through enterprise defenses. We’re no longer speculating. These capabilities are already here. AI as a Weapon The idea of cyberwarfare isn’t new. Nation-states and criminal organizations have been waging silent wars in cyberspace for decades. But AI drastically shifts the scale and speed of these operations. Take offensive AI, for example. Machine learning tools can now look on the internet for vulnerabilities by themselves, rank them by impact, and even personalize attacks against specific targets from information pilfered on social media, via leaked data, or past breaches. What once took human effort for months is now possible in hours or even minutes. More disturbing is the emergence of autonomous cyber weapons: AI-driven systems with the ability to make choices, learn from experience, and launch cyberattacks with minimal or no human involvement. These kinds of systems raise an existential ethical question: where does delegation end and abdication begin? If an AI system launches a cyberattack on critical infrastructure autonomously, who should be held responsible? The developer? The deployer? The AI? The Rise of AI Exploits The term "AI exploits" both implies exploitation by AI systems and AI as an exploitative agent. At one end, hackers are finding ways to mislead or circumvent AI systems through adversarial examples, data poisoning, model inversion, and prompt injection attacks. Through such methods, AI models can be fooled into making dangerous mistakes or revealing sensitive information. Conveyingly, though, AI is being used to exploit vulnerabilities in traditional systems, with ghastly efficiency. Security researchers have demonstrated that generative models can be trained to produce polymorphic malware that will change its signature to attempt to evade detection. Others have trained models to detect misconfigurations in cloud deployments or crack passwords more effectively than brute-force tools. There is a darkening ecosystem that is emerging around AI-based hacking tools, including some that are open-sourced or available through underground forums. The democratization of AI-driven exploits means that even low-bandwidth attackers can now have access to advanced tools, escalating the threat surface exponentially. Ethics in the Age of Digital Leviathans The ethical dilemma is not merely one of avoiding abuse. It is one of redefining what responsibility appears as in a world where code can think, learn, and act. For one, the pace of development is far outstripping the establishment of safeguards. AI developers, often racing for market share or academic prestige, may overlook or under-prioritize security. We’ve already seen examples where popular AI APIs were exploited to produce hate speech, violate privacy, or bypass content moderation. Also, there is no universal code of conduct for AI systems to operate or be used in offensive cyber environments. A system that is considered by one nation as defensive AI can be seen as a first-strike weapon by another nation. No international agreements or norms on weaponizing AI exist, something one recalls from the days of nuclear proliferation. Only now, the barrier to entry is so much lower. Are We Sleepwalking into a Cyber Doomsday? To say this is nothing but science fiction would be a disservice to the reality unfolding before our very eyes. AI now has the ability to detect and exploit security vulnerabilities on its own. Deepfakes have nullified the very foundation of belief in visual and audio evidence. Artificial social media robots can influence public perceptions and upend elections. Code written by AI can be rife with backdoors or merely written in a way that exploits esoteric logic bugs in compilers and runtimes. Now imagine those skills combined into a self-replicating cyberorganism, which is an AI-driven worm that learns, adapts, and replicates via networks, adapting its payload to match the target. It's not impossible. Indeed, researchers have already built proof-of-concepts based on this very threat model. The idea of a "cyber doomsday" is not necessarily a singular monolithic cataclysm. It might occur more insidiously: as the gradual erosion of online trust, large-scale disruption of service, and desensitization to AI-enabled sabotage. Manipulative adversary financial markets. Water or electrical grids, critical infrastructure, are taken down by autonomous exploits. Corporate and state secrets siphoned off by hyper-personalized social engineering. No Skynet required. Just apathy. Responsibility and Foresight We do have choices. Ethics never ought to be an afterthought, and it needs to be embedded in the very fabric of AI design. This means: Secure by design: AI models and platforms must be constructed with security as a core principle and not an additional feature.Red teaming and adversarial testing: AI systems must undergo rigorous red teaming to understand how they might be manipulated or exploited.Transparency and explainability: There are too many black boxes among AI systems. We must prioritize making explainable AI a priority so we understand how conclusions are reached and how they can be wrong.Accountability mechanisms: Governments and institutions must design regulatory mechanisms that hold creators and operators of AI accountable for its misuse, whether intentional or emergent.Global cooperation: Similar to nuclear weapons control and chemical weapon conventions, there must be global cooperation to define norms and red lines on AI use in cyberspace. What’s Next? AI is neither ethical nor unethical. It's a reflection of our own intentions, blind spots, and decisions. As we push the boundaries of what machines can do, we must also extend our capacity to anticipate the consequences. The ethics of AI exploitation is not a purely technical debate, but it's a social imperative. In the absence of visionary governance, the technologies we develop to empower humanity will become the harbingers of its digital collapse. The issue now is not whether AI can be employed for evil-already it is. The issue is: will we act with foresight and integrity sufficient to steer it away from the edge of the abyss? More

Trend Report

Kubernetes in the Enterprise

Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone's 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you're on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes' demands head-on.

Kubernetes in the Enterprise

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta DZone Core CORE
Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett
Java Caching Essentials

More Articles

Centralized Job Execution Strategy in Cloud Data Warehouses
Centralized Job Execution Strategy in Cloud Data Warehouses

Control Diagram The Architecture: Core Components Query vault tableController proceduresTrigger point (can be an external or internal trigger) Details of Core Components 1. Query Vault Table This table serves as the heart of this strategy. We can securely store all the queries that load data into the cloud data warehouse in this Vault table. This table contains the following fields: Job_Group: This is a logical segregation of queries between jobs. For example, if the Employee table gets its data from different sources and has multiple queries, then all these queries come under one ‘Job Group.’ Similarly, for the other tables, if there are multiple queries, they come under a separate ‘Job Group.’ We are distinguishing one set of queries from the others to help the controlling section identify the jobs effectively.Destination: This field represents the final table where we load the data into. For example, there could be multiple staging tables before we load into the final target table (Employee table example). We need to mention the final Target table (Employee in this case), which provides visibility on the final table (be it a fact or dimension) and helps the controlling section to further segregate things one level down, in case there is more than one table, if we intend to load.Job_Group_Order_Number: This field helps in constraint-based loading in a particular job group. If we have multiple target tables, under the same job group, and they need to be loaded in a certain order, then while making entries in the Query vault table, we need to insert the entries in the order that we want them to be executed. Sorting on this field will help in executing the jobs in the correct and desired order during orchestration.Action_Tag: This is a descriptive field predominantly for the summarized understanding of the entry in the Vault table. This doesn’t hold any significance, but it enhances the understanding of the users. This can be removed if needed.Query_Mode: This field specifies the action executed by the query. For example, if we are inserting data into a table, then this field contains ‘Insert.’ Certain Logics in the controlling sections are based on the Action Tag field.Query_Text: This field contains the actual query that needs to be executed. Please refer to the ‘Rules for Query Text field’ sub-section under the ‘Important Notes of Usage’ section for important details.Run_Order: When we have multiple queries that load into one final table, there must be some order of query execution that we should follow to load the data correctly. This field will provide the controlling section with the order in which the queries need to be executed. Start with 1 and increment by 1. Please refer to the ‘rules for Run Order field’ sub-section under the ‘Important Notes of Usage’ section for important details.Log_Data_Lifespan_in_Days: The log table holds the statistics for both current and history executions. If we mentioned 10 in this field, it means we are directing the log table to hold only 10 days of execution history. This field helps with backtracking of executions and failures. This becomes extremely important when we try to build an operational dashboard to monitor the health metrics of jobs.Run_Plan: This field can contain only two values. One is ‘Normal,’ which indicates the ‘Job Group’ will execute in a Normal schedule, like once a day, week, or month; the other is ‘Out of Sequence,’ which indicates that the job needs special runs. Its main purpose is to serve as a descriptive field that informs the user whether a job runs on a normal schedule or OOS.Execution_Switch: We often encounter situations where we want to skip execution of a query (or set of queries) during execution due to restarts or incremental loads, etc. This switch gives the user control to select which query needs to be executed in not-so-regular situations. If this field says ‘Yes’, it means that the entry in the vault table must be considered for execution. If this field says ‘No,’ then the execution of the query that belongs to this record will be skipped.Job_Status: The default value for this would be ‘Not Started.’ Once the job group is submitted for execution, the ‘job status’ will change to ‘In Progress,’ and if the query is successfully executed, ‘Job status’ will be changed to ‘Success.’ If the job fails for any reason, this field will serve as a marker for the controlling section to restart from the failed point/query, instead of restarting from the first query. 2. Controller Procedures Workflow Controller The workflow controller is a stored procedure written in JavaScript (Can be written using PLSQL too) to control the execution, especially in a situation where we are targeting to load more than one target table or want to coordinate the dependencies between two data models. Execution statement: call schema_name. workflow_controller_stored_proc_name (Job_Group, Schedule) ‘Job Group’ and Schedule are the parameters that a user needs to pass manually, based on which either the query selection for the execution will be done, or parameters will be created for an iterative stored proc to execute multiple job groups. Task Controller The task controller is a stored procedure intended to control the execution of queries within a Job group and Destination. This will be helpful, especially when we have dependencies between different modules or models, and the loads must be performed in a specific sequence. All such modules or models must be clubbed under one job group. We do not have to explicitly specify or make a call to the task controller, as the call happens automatically, internally, and iteratively until all the combinations of Job Group and Destination are exhausted. Execution statement: call schema_name.task_controller_stored_proc_name (Job_Group, destination) 3. Trigger Controller Trigger controller refers to a tool from which we invoke the workflow controller. This can be an ETL tool (Informatica PowerCenter, IDMC) or a feature like Snowflake task in Snowflake database, or any other tool that can invoke a stored procedure in a database. Let’s consider both Informatica Intelligent Data Management Cloud (IDMC) and Snowflake for explanation purposes. Informatica IDMC Create a simple mapping with a dummy source and target.Create a mapping task on top of the mapping.Define a parameter (say $$Workflow_Controller) and use that parameter in the pre-SQL section.Mention the parameter file name and location in the mapping task.Prepare the parameter file by assigning a procedure call statement to a pre-SQL parameter.The purpose of creating parameters is to reuse a single mapping for multiple jobs. Snowflake Tasks Create a task with the desired name. The definition of this task contains the call statement to the stored proc.Enable the task as it's suspended by default.Ensure the role with which you are executing the task has permissions for warehouse, task, and stored proc. How This Orchestration Works A call will be made to the workflow controller from the trigger controller based on the schedule of the job or a manual trigger.This call contains two important parameters: Job Group and Schedule. This step creates multiple call statements, or a single call statement, based on the number of values in the ‘destination’ field for a job group.These call statements generated in step 2 will be passed to the Task controller in the order they are generated. ‘Job_Group_Order_Number’ will help in generating the call statements in the desired order.For each call statement passed to the task controller, Query selection for execution will be done based on the parameters provided and the ‘execution switch’ field. The ‘Run Order’ field will help in executing the queries in the right order inside a ‘job group’ and destination.Workflow controller ensures that the task controller is iterating through all the queries under a ‘job group’ and destination. It also ensures that the task controller loops through all the call statements generated.During the execution of queries for a ‘Job group’ and destination, the log information is stored in a temporary table, which is then pushed to the log table. Log Table The audit logging feature is completely automated and does not require any manual intervention. The log table comes in handy to identify the reason for query failures, the number of records got inserted/updated / deleted, the Amount of time a job has taken to complete the run, etc. The structure of the log table is as follows. Start time: Indicates the query kick-off time.End time: indicates the query execution completion time.Run time: The time taken for the query to complete.Status: Run status of the query – completed or failedRecords inserted: Number of records inserted by the queryRecords updated: Number of records updated by the queryRecords deleted: Number of records deleted by the queryError: The reason why the query has failed Features Smart Restart If a query fails during execution due to some issue, and when you restart the job after fixing the issue, the job will restart from the query that failed last time instead of restarting from query 1. The intermittent table, which is used to store the audit logs temporarily, helps in identifying the failure point. This is the default functionality. There would be cases where we do not want to restart from the failure point, instead run from the first query of that job group and destination. In this case, the entries in the intermittent table, which stores the audit logs temporarily, need to be flushed to override the default functionality. Manual Control We might come across cases where we do not want to execute all the queries under a ‘Job Group’ and Destination fields all the time. We sometimes might want to skip queries in between. In such cases, flipping the execution switch to ‘No’ would skip that query execution. Query Change Detection If a query is changed in between executions, the strategy detects the change and restarts the execution from query 1 for a particular ‘job group’ and destination values. This feature will be helpful when a query fails due to duplicates/data issues (due to code) or syntax issues. In these cases, a query update is mandatory, and we don’t need to worry about flushing the intermittent table to restart the job. Automatic Log History Deletion Log data will get accumulated over a period and will slow down the process. This strategy will auto-clean the log data based on the number given under the ‘Log_Data_Lifespan_in_Days’ field. If the value is 30, then the log clean-up happens every 30 days, keeping the log table lighter. Important Notes for Usage Rules for Query Text field: Double hyphens, which we generally use to comment out a piece of code in the query, must be avoided, as there is a chance those hyphens might comment out the rest of the code in addition to the piece of code.If we are against formatting the query into a single line, we can use a backslash (\) at the end of each line.Enclose the query in double dollar signs to avoid any misreads when encountering special characters.Make sure the objects on which the queries operate must have access to the user running this execution strategy. These are called the caller’s rights. Rules for ‘Run Order’ Field: Do not provide the same number for a particular ‘Job Group’ and Destination value. If we provide the same number for this combination, the order in which the query is executed is not guaranteed. Rules for Parallel Execution: Multiple individual ‘Job Groups’ can be run in parallel by creating separate jobs in the trigger controller section.This strategy does not support parallel execution by default, but that can be achieved by assigning a different job group to the queries.Inter Job Group dependency can be set while making entries by using the ‘job_group_order_num’ field. Final Notes This approach comes in handy, especially in the ELT ecosystem, where we replicate the data into our data platform first and then process it. If This Strategy helps us to make use of the full potential of the cloud platform, as the complete execution happens within the cloud native data warehouse. Enhancements: We can add a field in the Vault table to provide the size of the cluster if we are expecting to handle huge amounts of data.A dashboard can be created on top of the Log table to capture job-level stats.

By Mohan Krishna Bellamkonda
*You* Can Shape Trend Reports: Join DZone's Database Systems Research
*You* Can Shape Trend Reports: Join DZone's Database Systems Research

Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you wish) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Database Systems Research With databases powering nearly every modern application nowadays, how are developers and organizations utilizing, managing, and evolving these systems — across usage, architecture, operations, security, and emerging trends like AI and real-time analytics? Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. Oh, and did we mention that anyone who takes the survey could be one of the lucky four to win an e-gift card of their choosing? We're diving into key topics such as: The databases and query languages developers rely onExperiences and challenges with cloud migrationPractices and tools for data security and observabilityData processing architectures and the role of real-time analyticsEmerging approaches like vector and AI-assisted databases Join the Database Systems Research Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our upcoming Trend Report. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team

By DZone Editorial
Python Development With Asynchronous SQLite and PostgreSQL
Python Development With Asynchronous SQLite and PostgreSQL

After years of working from the comfort of Python and Django, I moved to the wild asynchronous world of FastAPI to improve latency in web-based AI applications. I started with FastAPI and built an open-source stack called FastOpp, which adds command-line and web tools similar to Django. Initially, things went smoothly using SQLite and aiosqlite to add AsyncIO to SQLite. I used SQLAlchemy as my Object Relational Mapper (ORM) and Alembic as the database migration tool. Everything seemed to work easily, so I added a Python script to make things similar to Django’s migrate.py. As things were going smoothly, I added Pydantic for data validation and connected Pydantic to the SQLAlchemy models with SQLModel. Although I was pulling in open source packages that I wasn’t that familiar with, the packages were popular, and I didn’t have problems during initial use. Django comes with an opinionated stack of stable, time-tested tools, which I was starting to miss. However, I was really attracted to FastAPI features such as auto-documentation of APIs and the async-first philosophy. I continued forward by integrating SQLAdmin for a pre-configured web admin panel for SQLAlchemy. I also implemented FastAPIUsers. At this point, I ran into problems using FastAPIUsers in the same way I used Django authentication. I got my first glimpse of the complex world outside of the Django comprehensive ecosystem. I needed to implement my own JWT authentication and used FastAPIUsers as the hash mechanism. The FastAPI project has a full-stack-fastapi-template that I assessed as a starting point. I chose not to use it since my primary goal was focused on using Jinja2Templates for a streaming application from an LLM with heavy database use, both SQL and eventually a vector database using pgvector with PostgreSQL, or for simpler deployments, FAISS with SQLite and the FTS5 extension. My goal is to provide a more Django-like experience for FastAPI and provide the opportunity in the future to use the built-in API and auto-documentation of FastAPI instead of implementing something like Django REST framework, which I've found difficult to set up for automatic documentation of the API endpoints. I've considered for a long time whether it’s better to just use Django with asyncio from the beginning and not build a Django-like interface around FastAPI. In Django 6, there is some support for background tasks. My primary motivation for moving to FastAPI occurred when I was using Django for asynchronous communication with LLM endpoints. Although Django works fine with asynchronous communication, its default synchronous communication style created a number of problems for me. For average people like me, it’s difficult to keep a method asynchronous and not have any synchronous calls to other libraries that might be synchronous or other synchronous communication channels, like a database access. At this point, I wanted to simplify my code architecture and committed to FastAPI and make my code asynchronous from the beginning. It seemed simple. I thought I just needed to use an asynchronous driver with PostgreSQL and everything would work. I was wrong. Problems Moving to Asynchronous Database Connections psycopg2, psycopg3, or asyncpg The default way to connect to Python for many people is psycopg2. This is a very proven way. It is the default usage in most Django applications. Unfortunately, it is synchronous. The most common asynchronous PostgreSQL connector is asyncpg, but I couldn't get it to work in my deployment to Leapcell. As Leapcell had a psycopg2 example for SQLAlchemy, I used psycopg2 and rewrote the database connection to be synchronous while keeping everything around the connection asynchronous. As the latency with the LLM is much higher than the latency with the database, this seemed like a reasonable solution at the time. I just had to wait for the database to send me back the response, and then I was free to deal with other asynchronous problems, such as LLM query and Internet search status updates. The database latency was likely going to be less than 1,500ms in most queries, which was okay for my application. Using a synchronous connection to the database is great in theory, and I’m sure that other, more experienced Python developers can easily solve this problem and keep the synchronous and asynchronous code nicely separated with clean use of async and await. However, I ran into problems with organizing my code to use synchronous connections to the database within asynchronous methods that were talking to the LLM and storing the history in the database. As I was familiar with async/await from using Dart for many years, I was surprised I was having these problems. The problem I had might have been due to my lack of experience in understanding which pre-made Python modules were sending back synchronous versus asynchronous responses. I think that other Python developers might be able to understand my pain. To keep to an asynchronous database connection for both SQLite and PostgreSQL, I moved from the synchronous psycopg2 to asyncpg. SSL Security Not Needed in SQLite, But Needed in PostgreSQL Production The asyncpg connector worked fine in development, but not in production. Although establishing an SSL network connection seems obvious, I didn’t really appreciate this because I had been deploying to sites like Fly.io, Railway, and DigitalOcean Droplets with SQLite. For small prototype applications, SQLite works surprisingly well with FastAPI. I was trying to deploy to the free hobby tier of Leapcell to set up a tutorial for students who didn’t want to pay or didn’t want to put their credit card into a hosting service to go through a tutorial. There’s no way to write to the project file system on the Leapcell service engine. Leapcell recommends using Object Storage and PostgreSQL for persistent data. They do offer a free tier that is pretty generous for PostgreSQL. Leapcell requires SSL communication between their PostgreSQL database and their engine, which they call the service. Unfortunately, the syntax is different for the SSL mode between psycopg2 and asyncpg. I couldn’t just add ?sslmode=require to the end of the connection URL. Leapcell did not have an example for asyncpg. Likely due to my limited skills, I wasn’t able to modify my application completely enough to put the SSL connections in all the required places. In order to just use the URL connection point with sslmode=require, I decided to use psycopg3. Prepared Statements Caused Application to Crash With SQLAlchemy As I needed to use an async ORM in Python, I used SQLAlchemy. I didn’t have too much experience with it initially. I didn’t realize that even though I wasn’t making prepared statements in my Python application, the communication process between psycopg and PostgreSQL was storing prepared statements. Due to the way the connections were pooled on Leapcell, I had to disable the prepared statements. It took me a while to isolate the problem and then implement the fix. The problem never occurred when using SQLite because SQLite runs prepared statements in the same process using the same memory space as the Python program. This is different from PostgreSQL, where the network and session state can change. As I was worried about the performance impact of disabling prepared statements, I did some research, and it appears that SQLAlchemy does statement caching on the Python side. The real-world impact of disabling the prepared statement in PostgreSQL appears to be negligible. Summary Using SQLite in asynchronous mode has been quite easy. Getting PostgreSQL to work has been more difficult. There were three areas that I had trouble with for PostgreSQL: Asynchronous connection – how to write asynchronous Python code effectively to await the return data.Security – how to deal with both SQLite, which doesn’t require an SSL, and PostgreSQL in production, which does require an SSL.Prepared statements – I needed to learn to rely on the SQLAlchemy statement caching instead of the built-in prepared statements on the PostgreSQL server. I like FastAPI, and there are many huge advantages to using it that I got in the first hour of use. I’m going to continue using it instead of Django. However, I’m starting to really appreciate how much Django shielded me from much of the infrastructure setup for my applications. FastAPI is unopinionated in areas such as the database, connectors, authentication, and models. I find it difficult to gain expertise in any one area. Thus, I am focusing on a smaller set of open source components that work with FastAPI to gain a deeper understanding of their use. I feel that many other Python developers are on a similar journey to experiment more with asynchronous Python web applications. I would appreciate feedback and ideas on which open source components or techniques to use to build effective asynchronous AI applications. Resources FastOpp – Open source stack I am building around FastAPIFastAPI – A better Flask

By Craig Oda DZone Core CORE
Distributed Locking in Cloud-Native Applications: Ensuring Consistency Across Multiple Instances
Distributed Locking in Cloud-Native Applications: Ensuring Consistency Across Multiple Instances

Overview I am sure that most of us may have used some kind of locking during development, or may have faced issues of incorrect results in some states that are difficult to reproduce. Things are not that complex when we need to manage them within the process or even multiple processes, but on the same machine. It is also very common these days that most of us are involved in making cloud-native applications/services, where there are multiple instances of the service[s], either due to high availability/load balancing. In case of multiple instances of service[s], things become trickier when you face a situation where you need to make sure that certain operations must be performed in a synchronized manner, and it's not about multiple threads/processes but multiple pods/nodes in a native environment. In this blog, this aspect will be covered in detail so that you are aware of the challenges and choose the available choices with due diligence. Here, the primary focus would be on the concept side, like identifying the purpose of locking, challenges, available options, and decision-making factors. Assumption It is assumed that the reader is aware of race condition challenges and why it's important to take care of them in certain scenarios if applicable. Another assumption is that there are multiple instances of the application in a distributed environment, creating concurrent requests where locking is required to perform the task in the expected manner. Use Cases/Problem Statement There are two types of motivations for which you may require locking: 1. Efficiency In this case, you want to prevent a situation where multiple copies of an operation/job are running, although there is no problem if it happens, but that's a wastage of resources, and it doesn't lead to incorrect results/states in the end. For example, there is a job that copies files from one folder to another folder. Since multiple instances are running, all may try to perform this operation at the same time. While this may not lead to a corrupt state, there is no need to perform this operation in concurrent mode, as the final result would be the same. 2. Correctness In this case, if concurrent operation is allowed on the same state, it may lead to incorrect results and must be prevented. Basically, cases where a race condition may lead to an incorrect result must be prevented. For example, if you open two online banking sessions and try to transfer some money at the same time from your bank account, then locking must be there in order to prevent incorrect results. Locking levels may be different, i.e., sometimes taken care of by the underlying framework, like a database, which takes care of locking at one level, and sometimes you need to do it explicitly, due to your business workflow/use case, which is not addressed by the underlying technology/framework. For example: Taken care of by the database itself: Update employee set daalchini_balance = daalchini_balance - order_amount where employeeid = xyz. In this case, even if you run this query concurrently, the result would be correct as the database itself takes care of locking. Explicit Locking: Imagine a scenario where a parking sticker is to be given to an employee if not already taken and if parking stickers are available. There are two tables: one table with total stickers, and another table containing the details of each employee and their car details, along with sticker details. We need to make sure that even if multiple concurrent call comes from the same employee to take the sticker, it should still ensure to give the sticker only if it has not already been given. You may need to take an explicit lock to prevent a race condition when checking if the sticker has already been taken by an employee. Then, issue it if it has not been given and stickers are available, i.e., read and update scenario. Notes Please note that a lock may be required even for scenarios where the database is not there; here, we just took the database-based example for simplicity.Please also note, the lock needs to be efficient as it has a cost associated with it from the perspective of latency/throughput, i.e: Lock time should be as little as possible.Lock should be as granular as possible, i.e, don't block things which can be done concurrently. Like row-level lock is preferable to table-level lock wherever possible, and sometimes a combination, like a parking sticker example: lock at business logic level, employeeid, and leverage the underlying database level implicit locks to update records to achieve the desired state. Please note, our focus is not on atomicity here, so we are not discussing that part, but atomicity needs to be maintained wherever required. Challenges In order to provide locking in a distributed environment where multiple service instances are running on the same/different nodes, even different zones/regions, it becomes more challenging as locking information needs to be highly available and fault-tolerant. It may be provided by an external service, you may write your own locking mechanism either in a complex way within the same service or an external locking service to provide locking functionality, or you may use an existing third-party service[s] which may already exist in your ecosystem, like MySQL, GCP Cloud Storage, Redis, etc. It is clear that whatever mechanism is used, it must meet all the challenges of a distributed environment. A distributed environment has some basic characteristics (from the current context point of view), like: There may be network delay/disconnectivity.There may be data loss due to the failover process, as replication is mostly asynchronous from a performance point of view.Any instance[s] of a service can be down anypoint of time. Some Scenarios Imagine, instance 1 of a service took the lock from an external service and, due to some issue, couldn't proceed to do its work for some time, and meanwhile, a time-out happens, and instance 2 of the service gets the lock, and both are able to change the same state, which may lead to an incorrect result. It means you have to be very careful with the time-out part and need to make sure that if the original owner is unable to release the lock before the time-out period, either it finishes/commits its work before the time-out period or rolls back/undoes it, but shouldn't commit after the time-out period. The longer the time-out means, in case of one instance crashes, then another instance can't get the lock for a longer time.Imagine that instance 1 gets a lock, and after that, a failover happens at the locking server end, due to which lock information may be lost at the locking server end, as replication happens in an asynchronous manner, and locking information may not be propagated to other nodes. As a result, instance 2 can also acquire a lock, which may alter the same state, leading to an incorrect result. Distributed Lock Options RDBMS: In this case, you may look for whether your database provider gives an option to use a lock independent of your schema/table, or in other words, even if you are not using any database but just want to use RDBMS as a distributed lock manager (DLM). For example, in MySQL, you can use a named/user-level lock in order to synchronize multiple threads/instances. We have used it as MySQL was already part of the ecosystem, and it is very simple to use and meets our requirements. It is very good, especially when locking is required for correctness. Reference: https://dev.mysql.com/doc/refman/5.7/en/locking-functions.html Redis: You may use Redis for distributed locking as well. It is simple if you want to use it for an efficient purpose, as it's quite fast, and in rare cases, if multiple instances get hold of the lock at the same time, it is fine. If you want to use it for correctness purposes, then you may need to use the redlock algorithm/implementation. The support of the algorithm will depend on whether you are using self-hosted Redis or managed Redis, like GCP. References: https://redis.io/docs/reference/patterns/distributed-locks/ (Correctness) https://developpaper.com/implementation-code-based-on-redis-distributed-lock/ (Efficiency) https://dzone.com/articles/distributed-lock-implementation-with-redis Google Cloud Storage: This one is a very interesting way to leverage Google Cloud Storage for locking purposes. If you are using GCP as a cloud provider and Google Cloud Storage in your ecosystem and don't find any other options like MySQL/Redis, you may go for it. Reference: https://www.fullstaq.com/knowledge-hub/blogs/a-robust-distributed-locking-algorithm-based-on-google-cloud-storage There are many other options like Zookeeper, Hazelcast, etcd, Hashicorp Consul, etc. Warning Please note that locking has a cost associated with it; choose it wisely based upon your needs, and identify whether locking is required for efficiency or correctness.Please use the lock as granularly as possible and the duration as short as possible.It is highly recommended to do performance testing with concurrent requests that work on the same state. Recommendations If efficiency is required, you may go for Redis, and if correctness is required, you may go with MySQL, if these are already used in your ecosystem. Summary I hope you found this blog useful, and if you are already using distributed locking in your project, do share in the comment section on the mechanism used and your experience so far. References https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.htmlhttps://www.alibabacloud.com/blog/the-technical-practice-of-distributed-locks-in-a-storage-system_597141https://davidecerbo.medium.com/everything-i-know-about-distributed-locks-2bf54de2df71https://tanzu.vmware.com/developer/guides/spring-integration-lock/

By Navin Kaushik
Senior Developers, What to Read Next?
Senior Developers, What to Read Next?

Recently, one of my best friends, who is, in the meantime, one of the smartest developers I have the luck to know, asked me what book he should read next to further develop his skills. It took me some time to gather my thoughts, and it might be useful for others, too. Spoiler alert: I could not find a single book that I would say is the one to read as a senior developer. Instead, I summarized the books that I found good for one reason or another. As the summary also declared, this is a subjective list; feel free to agree or disagree with my choices, as well as feel free to leave a comment or contact me in any other way to share your thoughts. First of all, why read books? We are in 2025 — everything important can be summarized in a 160-character-long message, and nobody has the mental capacity to consume anything longer than four seconds. Well, jokes aside, it is a valid concern that books might get outdated; following tech websites can help you stay up to date. (And a currently hot topic, AI is a really good example for this. I am not aware of many well-written and still up-to-date books on that topic.) While this is true, I still have two reasons why people should engage in reading physical books: Because usually, physical books get way deeper into topics than shorter publications. And they tend to present not only the direct results, but also try to clarify the ideas, thoughts, and assumptions behind advice (i.e., how you shall write code) or engineering decisions (i.e., why a given framework's API was built in a particular way). If you want to learn more, these aspects are far more important than knowing a whole API by heart or just knowing the best practices without understanding the reasoning behind them.Because you will remember better: at least, my experience is that I can remember way better which book I read something in, as well as on which blog I read something. I can remember the font, the size of the book, and the number of pages on the left and right side; therefore, when I look for something I read on paper, I find it usually faster than when I look for something that I read online. This might not apply to everyone, but according to my colleagues, this applies to most of us. As a side note, I will not link any online shop for the books, but will provide the ISBN numbers, and you can look them up in the shop of your choice. The Core Let's start with my advice on two books. I know my friend has already read them, but I was unsure if he has the physical copies, too: Clean Code (ISBN 978-0132350884) and Clean Architecture (978-0134494166) by Robert C. Martin. You do not have to agree with everything in these, but I expect every single developer in my team to know about the concepts and arguments listed in these two. As a side note, the second edition of Clean Code (978-0135398579) will be released soon (October 27, 2025), and I am already excited to get it. In general, I think it is a good idea to re-read these books every couple of years. Technical Books I do not believe that any of the following books would be 100% new and unknown to developers who have been around for a while. Still, they might contain such aspects that you have not thought through until now, so they could be a good addition to your library. I really liked the book The Software Craftsman (978-0134052502) because it places software development in a different context, as well as discussing internal processes, interviewing, personal dedication, and other aspects. This book will not help you with low-level coding details, but it could provide you with some insights about the industry you are working in, which definitely helps you to improve as a developer, too. Get Your Hands Dirty on Clean Architecture (978-1805128373), this book managed to surprise me. It has a really good section about layered architecture. This book complements the Clean Architecture book really well by detailing some code-level aspects, which can be really helpful if you are not used to actually writing code according to clean architecture standards. Kent Beck's book, Test Driven Development (978-0321146533), is simply the best book I've seen on TDD. I really liked how he demonstrated the strength of TDD in multiple examples, even writing an xUnit tool. The next book does not strictly belong to this list, as it is not meant for senior developers with years, maybe even decades of experience, but I find its writing style really good. Java By Comparison (978-1680502879) aims to help developers with less experience learn about best practices and how to avoid some mistakes. If you, as a senior, are asked by a junior what to read, this could be a really good pick. Clean Agile (978-0135781869) provides insights into how R. C. Martin remembers the starting days of the agile movement — how it evolved, which are the situations when agile methodology fails to help, and, in general, for what it was originally intended to be used. Reading this can heavily improve your added value in agile projects, simply by having a better understanding of the methodology itself, and maybe you can help your team to be more efficient, regardless of how strictly you follow a given ruleset. Non-Technical Books The books mentioned until this point were written by tech people for tech people, which is ultimately a valid approach because we should learn from each other. The following books do not fall into the same category: they are not necessarily written by technical people and are meant for more than just technical audiences. I still recommend them. Humble Pi: When Math Goes Wrong in the Real World (978-0593084694) is a super entertaining book. You would not expect anything else if you are familiar with the author. Still, beyond being entertaining, it brings attention to responsibility, which I find really important to be aware of, as a software developer. Some mistakes lead to a lot of money being lost. Some mistakes lead to people's lives ending earlier. I am not willing to get all of us super paranoid, but from time to time, everyone should consider what can happen if something in their code malfunctions. In the same book, I found some interesting details on topics that I was not even aware of, and how complex they can be. My favorite part was about calendars: everyone knows that dealing with time zones and various ways of counting days and years can be tricky. But I was not aware that it could be this tricky, and how much struggle it was, when sending messages from one European city to another was not a matter of seconds but weeks. Have you ever felt you are Surrounded by Idiots (978-1785042188)? If you work as a developer and have never felt this, please contact me. I mean, really. I want to know where you work and how to apply. This book describes people and groups. I assume you are a person and you work in some sort of group. Therefore, this book could be useful to understand others, the dynamics in which they are most efficient. I do not believe this book will help you resolve conflicts. But it can help you understand the reasons for conflicts. The last book I advise is Thinking, Fast and Slow (978-0374533557). This one covers a lot of topics that are not or are only marginally related to software development. Still, understanding how people decide and what the limits of rational behaviour are can help you a lot if you want to practice some self-reflection. And I believe, most of us, most of the developers could practice a bit more of it. Wrapping Up Feel free to pick any book from this list for the reasons I mentioned or for any other reason you may have, or pick any other book you believe that will help you become a better developer. My main message here is just please do consume high-quality sources because we cannot fall into the same mistakes over and over again.

By Daniel Buza
Indexing Across Data Models: From Tables to Documents to Text
Indexing Across Data Models: From Tables to Documents to Text

Every modern software application relies on a database to persist and manage its data. The choice of database technology is largely influenced by the application’s data model and its read and write throughput. For large datasets, query efficiency is critical. An inefficient query that works on a small dataset can quickly turn into a performance bottleneck when scaled to hundreds of thousands or millions of data points. While query optimization helps, it alone cannot guarantee high throughput. Factors such as data modeling, normalization, partitioning strategies, indexing, and even hardware resources all play a role in determining how quickly a system can serve reads and process writes. In this article, we will focus on indexes — one of the most powerful techniques for improving overall database performance. We’ll explore what indexes are, how they work, and how they can dramatically influence both read and write throughput in real-world systems. Indexes Imagine you are visiting a library to find a book to read. The library has organized its books into sections, e.g., science, fiction, or autobiographies. If you are interested in fiction, you head straight to that section, skipping the others. Within the fiction section, suppose the books are arranged alphabetically by title on different racks. If you are looking for a book that starts with “L,” you would go directly to the rack labeled “L” instead of scanning every shelf. This is similar to how indexes work in databases. Just like the sections in a library help you quickly find the right book, a database index helps the system quickly find the right data — without searching through everything. Imagine if the library didn’t have any sections or alphabetical arrangement. You would have to check each book one by one until you find the one you are looking for. The bigger the library, the longer it would take. In computing terms, this kind of search is called O(n) — which means the time it takes grows in direct proportion to the number of items. So, if the library doubles in size, your search time roughly doubles too! That’s why indexes are so important in databases — they help avoid this slow, manual search and make data retrieval efficient and much faster. Definition In its minimalistic form, an index can be defined as — An additional metadata, derived from primary data, serving as an indicator to efficiently locate the data. In the library analogy, the category and title-wise section serves as the metadata to quickly locate the required book. A typical index has the following characteristics: Indexes only utilize the primary data without affecting it, i.e., whether indexed or not, the primary data stays as-is.While indexes improve read, they lower the write throughput as the indexes require updates for every change in primary data.There could be multiple indexes on the same data for different ways of lookup. E.g., a book could be indexed independently based on its category, title, author, publisher, etc. The index could even be a combination of these, which is termed a composite index.Most indexes are to be chosen manually so that the application is in control instead of the tools themselves. Type of Indexes As the primary purpose of an index is to boost reads, the read (or query) pattern determines the index type or its structure for optimal results. For key-value data, a hash-based index would suffice if data is queried via its key only. In case the query requires multiple attributes of data (i.e., value), then indexes based on Log-structured merge-tree (LSM) or B-Tree are utilized. Moreover, if a multi-dimensional range query is required, then R-Tree-based indexes are required. Let's explore various index types and their corresponding use cases. Hash Index Hash indexes use a hash function to map keys to a hash bucket. This enables constant time, i.e., O(1), lookup for an exact match. It is best suited for key-value type data lookup with unique keys. Since hashing doesn’t maintain order, hash indexes are unsuitable for any range-based queries. Typical Hash Indexes Popular tools that utilize hash-based indexes include MySQL Memory (HEAP) Engine, which defaults to hash indexes for in-memory tables; Redis, known for its core data structures that deliver microsecond-level access; and PostgreSQL, which serves as an alternative to default B-trees and is optimized for equality checks. Log-Structured Merge Tree (LSM) LSM tree indexes maintain an in-memory index structure for all the writes, which is often a balanced tree, like a red-black tree. To provide durable storage of writes, they are also updated sequentially as write-ahead logs. On reaching a certain threshold, this in-memory index data is flushed to disk as an immutable, sorted file called a Sorted String Table (SSTable), which acts as the core index. To serve any reads first in-memory index is consulted, followed by SSTables. To add more efficient lookup, Bloom filters are utilized to rule out SSTables that may not have the requested data. Moreover, the background process ensures the stale entries are periodically purged from SSTables via merge and compaction. Typical Log-Structured Merge-Tree Indexes As the LSM tree is designed for write-heavy workloads to achieve excellent write throughput, these indexes are suitable for large-scale key-value stores only (e.g., Cassandra, LevelDB, RocksDB). Depending upon the data spread across multiple SSTables, the read throughput may be slower than that of other indexes. Note: Apache Lucene utilizes LSM for fuzzy text search up to a certain edit distance. B-Tree A B-tree is a balanced tree structure where data is stored in sorted order across nodes called pages. Each node may have multiple children, ensuring the tree remains shallow and lookups are fast. Searching, inserting, and deleting involve traversing from the root down to the leaf, usually in O(log n) time. A B‑tree index is much like a library index — instead of scanning every book, follow the guideposts until you locate the exact book you need. Typical B-Tree Indexes A B-tree is ideal for read-heavy workloads and range queries. Thus, this is the most common index in relational databases (e.g., MySQL — InnoDB, PostgreSQL, Oracle) where predictable query performance is critical. It's imperative to note that most databases use a B+ tree variant instead, where only leaf nodes store row pointers, and internal nodes only guide navigation. This makes range scans even faster. Rectangle-Tree (R-Tree) A rectangle-tree (R-tree) is a hierarchical, balanced tree structure designed to index multi-dimensional spatial data like points, rectangles, polygons, or geographic coordinates. R-tree can be considered as the spatial equivalent of a B-tree, which is for one-dimensional data only. In an R-tree, each node represents a minimum bounding rectangle (MBR) or Bounding Box (BBOX) that encloses its children. The root node covers the entire dataset, and internal nodes contain the MBRs pointing to child nodes. The leaf nodes store actual object data or its references. So when a range or geo-spatial query is made, only the relevant bounding rectangles are traversed. Thu,s eliminating the full traversal. R-tree search, insert/delete is O(log n) with worst case as O(n) only if bounding boxes overlap heavily. Typical Rectangle Tree Indexes R-tree is most efficient for range queries such as “all restaurants within 2 km” or searching for nearest-neighbor, like “closed hospital from current location.” PostgreSQL (with PostGIS) uses R-tree-over-GiST spatial indexes for geospatial queries. SQLite has a built-in R-tree module for geometry-heavy apps. Many GIS systems, including ArcGIS, QGIS, and other mapping platforms, rely on R-trees for fast rendering and querying. Note: Elasticsearch uses a BKD tree to support geospatial query indexing. Internally, this BKD tree is converted to an R-tree. Bitmap Index A bitmap index is a special type of database index that uses bit arrays (bitmaps) instead of traditional tree structures. Each distinct value in a column gets its own bitmap (a sequence of 1s and 0s), where each bit corresponds to a row. 1 represents the row containing the value, and 0 represents the row that does not contain the value. Typical Bitmap Indexes Bitmap indexes are best suited for low-cardinality columns, i.e., with few distinct values, e.g., gender, boolean flags, marital status, region codes, etc. Moreover, any bitwise operations (AND, OR, NOT) on bitmaps are extremely efficient. Bitmap indexes are commonly used in data warehouses and read-heavy systems where queries often combine multiple filters. Oracle Database offers native bitmap indexing while PostgreSQL supports bitmap scans (but not persistent bitmap indexes by default). Apache Hive and Spark SQL use bitmap-based techniques in their respective query optimization layers. Inverted Index An inverted index is a data structure that maps terms (words) to the list of documents (or locations) where they appear. In an inverted index, each unique word in the dataset is stored once. Before being stored, the word is normalized, i.e., transformed to lowercasing, stemming, or removed if it's a stop word. Against the word, a list of document IDs (and sometimes positions) where that word occurs is stored. This enables the query to return the documents quickly where the specified word is present. Typical Inverted Indexes Inverted indexes are the backbone of modern search engines. They are used in log aggregation platforms (e.g., Splunk, ELK stack) to query terabytes of machine data in near real time. Moreover, some RDBMS (like PostgreSQL’s GIN index) and NoSQL systems (Elasticsearch, Apache Lucene, Solr, etc.) use inverted indexes too for full-text search. Vector Index A vector index is a data structure designed to store and search high-dimensional vectors efficiently. Instead of exact matches (like traditional indexes), they enable similarity search — finding items, e.g., “Find me all the products in the image that look like this one,” or “Suggest songs that feel like this track.” While B-Trees unlocked fast relational queries, Vector indexes are unlocking fast semantic queries. Vector indexes handle such high-dimensional vector embeddings via mathematical representations of text, images, audio, or user behavior. At their core, vector indexes organize embeddings in a way that makes Approximate Nearest Neighbor (ANN) search efficient. Typical implementations include Inverted File Index (IVF), Hierarchical Navigable Small World Graph (HNSW), and Product Quantization (PQ). Typical Vector Indexes Typical usage of Vector indexes includes recommendation systems, computer vision — such as image or video retrieval by “visual similarity” — and AI memory and retrieval-augmented generation (RAG), where large language models (LLMs) retrieve relevant context from vast knowledge bases. Several modern tools and databases implement vector indexing under the hood, e.g., Facebook AI Similarity Search (FAISS) — an open-source library for ANN search, and Annoy (Spotify) — optimized for recommendation systems. Secondary Index A secondary index is an additional index created on a non-primary key attribute(s) of a table. Unlike the primary index (which is tied to the primary key and usually unique), a secondary index can be built on any column(s), even those with duplicate values. Whenever data is inserted, updated, or deleted, the secondary index must also be updated — so there is a trade-off between faster reads and slower writes. Typically, secondary indexes are used to speed up queries that filter or sort on non-primary key columns. Also, it allows efficient access paths for multiple attributes in the same data. Secondary indexes are especially helpful in large datasets where scanning the whole table would be too slow. In traditional relational databases, PostgreSQL supports B-tree, Hash, GIN, and BRIN secondary indexes. MySQL/MariaDB has secondary indexes on InnoDB tables. Oracle and SQL Server heavily rely on secondary indexes for query optimization. In NoSQL Databases, MongoDB allows secondary indexes on fields beyond _id similar to CouchBase, which terms these as “Global Secondary Indexes GSI” instead. In Elasticsearch, its inverted index structure serves a similar purpose for text search. Composite or Concatenated Index A composite index (also called a concatenated index or multi-column index) is an index built on two or more columns of a table, instead of just one. It stores a combined, ordered structure of multiple columns, allowing the database to quickly locate rows based on those columns together. The sequence of columns in the index is critical. E.g., an index on (last_name, first_name) can efficiently support queries filtering by last_name alone, or by both last_name and first_name — but not by first_name alone. Having individual indexes on each column is not the same; the optimizer may not combine them as efficiently as a single composite index. A composite index is typically useful when queries often filter, sort, or join on the same set of columns. Composite index is supported in all major relational databases. PostgreSQL uses B-tree composite indexes with advanced variations like multicolumn GiST indexes. SQL Server supports composite clustered and non-clustered indexes, with options like included columns. Oracle supports concatenated indexes with optimizer hints for fine-tuning. Even NoSQL systems like MongoDB support compound indexes, applying similar principles. Clustered Index A clustered index is a way of organizing data in a database table so that the rows are physically stored on disk in the same order as the index. Since the table’s rows can only be stored in one order, only one clustered index per table is possible. Queries that search by the clustered index key (e.g., primary key, date, or ID) are very efficient because the data is already sorted that way. It’s usually implemented as a B-tree, where leaf nodes contain the actual data rows. A clustered index could be on a single column or on multiple columns (i.e., composite index), too. Clustered indexes are best suited for range queries since the rows are stored in order, and the database can quickly scan a contiguous block. Queries with ORDERBY or GROUP BY on the clustered column often skip extra sorting steps. Since the clustered index is often built on the primary key, fetching rows by ID is thus lightning fast. However, clustered indexes can slow down insert-heavy workloads (because new rows must be slotted into the correct physical position) and make updates to indexed columns more expensive. SQL Server and MySQL (InnoDB engine) always cluster data around the primary key. PostgreSQL doesn’t maintain clustered indexes automatically, but it can be used with the CLUSTER command to reorder a table physically. Oracle uses Index-Organized Tables (IOTs), which are essentially clustered indexes. Conclusion While most software professionals don’t need to master every internal detail of database indexes, understanding their fundamentals is highly valuable. Indexes are not just abstract data structures; they directly shape how efficiently applications handle queries, scale under load, and balance read/write performance. With this knowledge, engineers and architects can make more informed choices about which database engines or index types best suit their data models and query patterns. However, real-world performance depends on data distribution, workload, and hardware. That’s why benchmarking with actual datasets is essential before adopting or switching indexes. Thoughtful selection, validated by testing, ensures systems remain efficient, resilient, and scalable in practice. References and Further Reading Log-structured_merge-treeB-treeGiSTPostGISGeographic_information_systemPostgres GINR-treeInverted_indexVector_space_modelDesigning Data-Intensive Applications

By Ammar Husain
Infusing AI into Your Java Applications
Infusing AI into Your Java Applications

Artificial intelligence (AI) is becoming increasingly pervasive. As an Enterprise Java developer, you might be wondering what value AI can add to your business applications, what tools Java provides to easily do that, and what skills and knowledge you might need to learn. In this article, we equip you with the basic knowledge and skills that you need to start exploring the capabilities of AI to build intelligent and responsive Enterprise Java applications. When we talk about AI in this article, we mean getting responses from a large language model (LLM) based on a request that the Java application sends to the LLM. In our article’s example, we create a simple chatbot that customers can ask for planetary tourist destination recommendations, and then use to book a spaceship to visit them. We demonstrate using Java frameworks like LangChain4j with Quarkus to efficiently interact with LLMs and create satisfying applications for end-users. Hello (AI) World: Getting an LLM to Respond to a Prompt The first version of our spaceship rental application will build a chatbot that interacts with customers using natural language. It should answer any customer questions about planets they wish to visit in the solar system. For the full application code, see spaceship rental step-01 directory in the GitHub repository. The chatbot sends the customer's questions to the application, which interacts with the LLM to help process the natural language questions and to respond to the customer. For the AI-related parts of the application, we create just two files: An AI service, CustomerSupportAgent.java, which builds a prompt informing the LLM about our solar system’s planets and instructs the LLM to answer questions from customers.A WebSocket endpoint, ChatWebSocket.java, which receives the user’s messages from the chatbot. AI services are Java interfaces that provide a layer of abstraction. When using LangChain4j, these interfaces make LLM interaction easier. AI services are an integration point, so in a real application, you would need to consider security, observability, and fault tolerance of the connections and interactions with the LLM. As well as handling LLM connection details (stored separately in the application.properties configuration file), an AI service builds the prompts and manages chat memory for the requests it sends to the LLM. The prompt is built from two pieces of information in the AI service: the system message and the user message. System messages are typically used by developers to give the LLM contextual information and instructions for handling the request, often including examples that you want the LLM to follow when generating its response. User messages provide the LLM with application user requests. The CustomerSupportAgent interface is registered as the AI service in the application. It defines the messages used to build the prompt and sends the prompt to the LLM: Java @SessionScoped @RegisterAiService public interface CustomerSupportAgent { @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. You answer questions from potential guests about the different planets they can visit. If asked about the planets, only use info from the fact sheet below. """ + PlanetInfo.PLANET_FACT_SHEET) String chat(String userMessage); } Let’s look at what this code is doing. The @SessionScoped annotation maintains the session for the duration of the web service connection and maintains the chat memory for the duration of the conversation. The @RegisterAIService annotation registers an interface as an AI service. LangChain4j automatically implements the interface. The @SystemMessage annotation tells the LLM how to behave when responding to the prompt. When the end user types a message in the chatbot, the WebSocket endpoint passes the message to the chat() method in the AI service. There is no @UserMessage annotation specified in our AI service interface, so the AI service implementation automatically creates a user message with the chat() method parameter value (in this case the userMessage parameter). The AI service adds the user’s message to the system message to build a prompt that it sends to the LLM, then displays the response from the LLM in the chatbot interface. Note that, for readability, the planet information has been placed in a separate PlanetInfo class. Alternatively, you could place the planet information directly in the system message. The ChatWebSocket class defines a WebSocket endpoint for the application’s chatbot UI to interact with: Java @WebSocket(path = "/chat/batch") public class ChatWebSocket { private final CustomerSupportAgent customerSupportAgent; public ChatWebSocket(CustomerSupportAgent customerSupportAgent) { this.customerSupportAgent = customerSupportAgent; } @OnOpen public String onOpen() { return "Welcome to Rocket's Cosmic Cruisers! How can I help you today?"; } @OnTextMessage public String onTextMessage(String message) { return customerSupportAgent.chat(message); } } The CustomerSupportAgent interface uses constructor injection to automatically provide a reference to the AI service. When the end user types a message in the chatbot, the onTextMessage() method passes the message to the AI service chat() method. For example, if the user asks, "What's a good planet to visit if I want to see volcanoes?", the application responds with a recommendation and why the user might like to visit there, as a fan of volcanoes: The Spaceship Rental application chatbot Providing an Illusion of Memory As you continue your conversation with the chatbot, it might seem as though it is aware of previous messages exchanged, that is, the context of your conversation. When you talk to another person, you take for granted that they remember what you (and they) last said. Requests to an LLM are stateless, though, so each response is generated solely based on the information contained within the request prompt. To maintain context in a conversation, the AI service uses chat memory, through LangChain4j, to store prior user messages and the chatbot’s responses. By default, the Quarkus LangChain4j extension stores the chat in memory, and the AI service manages the chat memory (for example, by dropping or summarizing the oldest messages) as needed to remain within the memory limits. LangChain4j by itself would require you to first configure a memory provider, but that is not needed when using the Quarkus LangChain4j extension. This gives a practical illusion of memory to end users and improves the user experience so they can enter follow-on messages without needing to repeat everything they previously said. The user chatbot experience can also be improved by streaming the responses from the LLM. Streaming Responses for a More Responsive User Experience You might notice responses to your chat message window take time to generate and then appear all at once. To improve the chatbot’s perceived responsiveness, we can modify the code to return each token of the response as it is generated. This approach, called streaming, allows users to start reading a partial response before the entire response is available. For the full application code, see the GitHub spaceship rental step-02 directory. Changing our application to stream the chatbot response is easy. First, we’ll update the CustomerSupportAgent interface to add a method that returns an instance of the SmallRye Mutiny Multi<String> interface: Java @SessionScoped @RegisterAiService @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. You answer questions from potential guests about the different planets they can visit. If asked about the planets, only use info from the fact sheet below. """ + PlanetInfo.PLANET_FACT_SHEET) public interface CustomerSupportAgent { String chat(String userMessage); Multi<String> streamChat(String userMessage); } Moving the @SystemMessage annotation to the interface means that the annotation doesn’t have to be added to each of the methods in the interface. The streamChat() method returns the LLM’s response to the chat window one token at a time (instead of waiting to display the full response all at once). We also need to call the new streamChat() method from a WebSocket endpoint. To preserve both batch and stream functionality, we create a new ChatWebSocketStream class that exposes the /chat/stream WebSocket endpoint: Java @WebSocket(path = "/chat/stream") public class ChatWebSocketStream { private final CustomerSupportAgent customerSupportAgent; public ChatWebSocketStream(CustomerSupportAgent customerSupportAgent) { this.customerSupportAgent = customerSupportAgent; } @OnOpen public String onOpen() { return "Welcome to Rocket's Cosmic Cruisers! How can I help you today?"; } @OnTextMessage public Multi<String> onStreamingTextMessage(String message) { return customerSupportAgent.streamChat(message); } } The customerSupportAgent.streamChat() call invokes the AI service to send the user message to the LLM. After making some minor tweaks to the UI, we can now toggle streaming on and off in our chatbot: The application with the new streaming platform enabled With streaming enabled, each token (each word, or part-word) produced by the LLM is immediately returned to the chat interface. Generating Structured Outputs From Unstructured Data Up to this point, the LLM’s outputs have been intended for the application’s end user. But what if, instead, we want the LLM’s output to be used directly by our application? When the LLM responds to a request, the AI service that mediates the interaction with the LLM can return structured outputs, which are formats that are more structured than a String, such as POJOs, lists of POJOs, and native types. Returning structured outputs significantly simplifies the integration of an LLM’s output with your Java code because it enforces that the output received by the application from the AI service maps to your Java object’s predefined schema. Let’s demonstrate the usefulness of structured outputs by helping the end user select a spaceship from our fleet that meets their needs. For the full application code, see the GitHub spaceship rental step-03 directory. We begin by creating a simple Spaceship record to store information about each individual spaceship in the fleet: Java record Spaceship(String name, int maxPassengers, boolean hasCargoBay, List<String> allowedDestinations) { } Similarly, to represent the user’s query about the spaceships in our fleet, we create a SpaceshipQuery record, which is based on the information the user provided in the chat: Java @Description("A request for a compatible spaceship") public record SpaceshipQuery(int passengers, boolean hasCargo, List<String> destinations) { } The Fleet class populates several Spaceship objects and provides a way to filter out those that do not match the user. Next, we update the CustomerSupportAgent interface to take the user’s message (unstructured text) to create a structured output in the form of the SpaceshipQuery record. To accomplish this feat, we only need to set the return type for a new extractSpaceshipAttributes() method in our AI service to be a SpaceshipQuery: Java SpaceshipQuery extractSpaceshipAttributes(String userMessage); Under the covers, LangChain4j automatically generates a request to the LLM including a JSON schema representation of the desired response. LangChain4j deserializes the JSON-formatted response from the LLM and uses it to return a SpaceshipQuery record, as requested. We also need to know whether the user’s input is about one of our spaceships, or about some other topic. This filtering is accomplished using a simpler, structured output request that returns a boolean: Java @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. Respond with 'true' if the user message is regarding spaceships in our rental fleet, and 'false' otherwise. """) boolean isSpaceshipQuery(String userMessage); Our last addition to the CustomerSupportAgent interface enables the agent to provide a spaceship suggestion based on our fleet and the user’s request, with and without streaming: Java @UserMessage(""" Given the user's query regarding available spaceships for a trip {message}, provide a well-formed, clear and concise response listing our applicable spaceships. Only use the spaceship fleet data from {compatibleSpaceships} for your response. """) String suggestSpaceships(String message, List<Spaceship> compatibleSpaceships); @UserMessage(""" Given the user's query regarding available spaceships for a trip {message}, provide a well-formed, clear and concise response listing our applicable spaceships. Only use the spaceship fleet data from {compatibleSpaceships} for your response. """) Multi<String> streamSuggestSpaceships(String message, List<Spaceship> compatibleSpaceships); } Our last step is to update the ChatWebSocket and ChatWebSocketStream classes to first check if the user’s query is about spaceships in our fleet. If so, the customer support agent creates a SpaceshipQuery record by extracting the information from the user’s message and then responds with suggested spaceships from the fleet that are compatible with the user’s request. The updated code is similar for both the ChatWebSocket and ChatWebSocketStream classes, so only the ChatWebSocket class is shown here: Java @OnTextMessage public String onTextMessage(String message) { boolean isSpaceshipQuery = customerSupportAgent.isSpaceshipQuery(message); if (isSpaceshipQuery) { SpaceshipQuery userQuery = customerSupportAgent.extractSpaceshipAttributes(message); List<Spaceship> spaceships = Fleet.findCompatibleSpaceships(userQuery); return customerSupportAgent.suggestSpaceships(message, spaceships); } else return customerSupportAgent.chat(message); } With these updates, the customer support agent is ready to use the structured outputs to provide the user with spaceship suggestions: The application providing the user with spaceship suggestions based on the structured output With that, we have completed an AI-infused Java chatbot application that provides planetary tourism recommendations and spaceship rentals. To continue learning, experiment with the full code of our sample application alongside the Quarkus with LangChain4j docs. More on These AI Concepts We’ve discussed various AI concepts throughout this article. If you want to know more about any of them, here is a quick explainer. Large Language Models (LLMs) When we talk about AI in this article, we generally mean getting responses from a large language model. LLMs are machine learning models that are trained to generate a sequence of outputs based on a sequence of inputs (often text inputs and outputs, but some multi-modal LLMs can work with images, audio, or video). LLMs can perform a wide variety of tasks, such as summarizing a document, translating between languages, fact extraction, writing code, etc. This task of creating new content from the input is what’s referred to as Generative AI, or GenAI. You can infuse such capabilities into your application as needed. Making Requests to LLMs: Prompts, Chat Memory, and Tokens How you request information from an LLM influences not only the response you get back from the LLM but also the end user’s experience and the application's running costs. Prompts Sending a request to an LLM, whether from application code or as an end-user in a chat interface, involves writing a prompt. A prompt is the information (usually, but not always, text) to which the LLM responds. If you think of communicating with an LLM like communicating with another person, how you phrase your request is important to making sure the other person (or the LLM, in this case) understands what you want to know. For example, ensuring that you give the context of the request before going on to ask for a specific piece of information, and not providing lots of irrelevant information to confuse the listener. Chat Memory Unlike when you are talking to another person, LLMs are stateless and don’t remember the previous request, so everything you need the LLM to take into consideration needs to be in your request: the prompt, any previous requests and responses (the chat memory), and any tools you provide to help the LLM respond. However, providing too much information to the LLM in the prompt can potentially complicate the request. It can also be costly. Tokens LLMs convert the words in your prompt into a sequence of tokens. Most hosted LLMs charge usage based on the number of tokens in the request and response. A token can represent a whole word or a part of a word. For example, the word "unbelievable" is typically split into multiple tokens: "un", "bel", and "ievable". The more tokens that you include in the request, especially when you include all the chat memory, the greater the potential cost of running the application. Providing all the chat memory in a request can make requests both costly and less clear. Requests to LLMs are limited in length, so it’s important to manage the chat memory and how much information is included in the request. This can be helped a lot by the Java frameworks that you use, such as LangChain4j with Quarkus, which we use for the sample application in this article. LangChain4j and Quarkus Frameworks LangChain4j is an open-source Java framework that manages interactions between Java applications and LLMs. For example, LangChain4j, through the concept of AI services, stores and helps you to manage chat memory, so that you can keep requests to the LLM efficient, focused, and less expensive. Quarkus is a modern, cloud-native, open-source Java framework optimized for developer productivity, running in containerized environments, and with fast startup and low memory usage. The LangChain4j extensions to Quarkus simplify the configuration of connecting to and interacting with LLMs in AI-infused Java applications. The LangChain4j project can be used with other Java application frameworks, including Open Liberty, Spring Boot, and Micronaut. MicroProfile and Jakarta EE are also working together with LangChain4j to provide an open standards-based programming model for developing AI applications. The Sample Application You can find the complete sample application that we demonstrate throughout this article on GitHub. The application is written in Java and runs on Quarkus using the Quarkus LangChain4j extensions. Conclusion Infusing AI into Java applications enhances the application’s capabilities and the end-user's experience. With the help of Java frameworks like Quarkus and LangChain4j to simplify interactions with LLMs, Java developers can easily infuse AI into business applications. Writing AI-infused applications in Java means you're working in Java’s robust, enterprise-ready ecosystem, which not only helps you to easily interact with AI models, but also makes it easy for the applications to benefit from enterprise essentials such as performance, security, observability, and testing. The field of AI is rapidly evolving. By mastering the concepts and technologies in this article, you can stay ahead of the curve and start exploring how AI can help you build intelligent and engaging Java applications. Experiment with the full code of our sample application alongside the Quarkus with LangChain4j docs. If you’d like to learn more, try this tutorial on how to extend the knowledge of the LLM with content from PDF documents by using retrieval-augmented generation (RAG): Build an AI-powered document assistant with Quarkus and LangChain4j. Thanks to Red Hatters Clement Escoffier, Markus Eisele, and Georgios Andrianakis for valuable review comments. This article was originally published on InfoQ on Aug 15, 2025.

By Don Bourne
Where Stale Data Hides Inside Your Architecture (and How to Spot It)
Where Stale Data Hides Inside Your Architecture (and How to Spot It)

Every system collects stale data over time — that part is obvious. What’s less obvious is how much of it your platform will accumulate and, more importantly, whether it builds up in places it never should. That’s no longer just an operational issue but an architectural one. In my experience, I’ve often found stale data hiding in corners nobody thinks about. On the surface, they look harmless, but over time, they start shaping system behavior in ways that are hard to ignore. And it’s not just a rare edge case: studies show that, on average, more than half of all organizational data ends up stale. That means the risks are not occasional but systemic, quietly spreading across critical parts of the platform. The impact isn’t limited to performance. Outdated records interfere with correctness, break consistency across services, and complicate debugging. What is more, stale data quietly consumes storage and processing resources, increasing operational costs. Based on what I’ve seen in enterprise platforms, I can point to several hidden spots that deserve far more attention than they usually get. Where Stale Data Finds Room to Hide My team often joins enterprise projects with goals like improving performance or reducing costs. Each time, the same lesson surfaces: by examining the spots below, platforms become leaner, faster, and far easier to maintain. Cache Layers as Hidden Conflict Zones Stale data often hides not in caching itself but in the gaps between cache layers. When application, storefront, and CDN caches don’t align, the system starts serving conflicting versions of the truth, like outdated prices or mismatched product images. In one enterprise ecommerce platform, we traced product inconsistencies back to five overlapping cache levels that overwrote each other unpredictably — a classic case of caching mistakes. The fix required reproducing the conflicts with architects and tightening configurations. A clear warning sign that your cache may hide stale data is when problems vanish after cache purges, only to return later. It often means the layers are competing rather than cooperating. Synchronization Jobs That Drift Another source of stale data is asynchronous synchronization. On paper, delayed updates look harmless, as background jobs will “catch up later.” In practice, those delays create a silent drift between systems. For example, users of a jewelry platform saw outdated loyalty points after login because updates were queued asynchronously. Customers assumed their balances had disappeared, support calls surged, and debugging became guesswork. The issue was fixed by forcing a back-end check whenever personal data pages were opened. A common signal is when user-facing data only appears correct after manual refreshes or additional interactions. Historical Transaction Data That Never Leaves One of the heaviest anchors for enterprise systems is transactional history that stays in production far longer than it should. Databases are built to serve current workloads, not to carry the full weight of years of completed orders and returns. This is exactly what my team encountered in a European beauty retail platform: the production database had accumulated years of records, slowing queries, bloating indexes, and dragging overnight batch jobs while costs crept higher. The fix was smart archiving, with moving old records out of production and deleting them once the retention periods expired. A telling signal is when routine reports or nightly jobs begin stretching into business hours without clear functional changes. Legacy Integrations as Silent Data Carriers Integrations with legacy systems often look stable because they “just work.” The trouble is that over time, those connections become blind spots. Data is passed along through brittle transformations, copied into staging tables, or synchronized with outdated protocols. At first, the mismatches are too small to notice, but they slowly build into systemic inconsistencies that are painful to trace. A signal worth watching is when integrations are left undocumented, or when no one on the team can explain why a particular sync job still runs. That usually means it’s carrying stale data along with it. Backups With Hidden Liabilities Backups are the one place everyone assumes data is safe. The paradox is that safety can turn into fragility when outdated snapshots linger for years. Restoring them may quietly inject obsolete records back into production or test systems, undermining consistency at the very moment resilience is needed most. The architectural pain is in rising storage costs and the risk of corrupted recovery. A simple indicator is when backup retention policies are unclear or unlimited. If “keep everything forever” is the default, stale data has already found its way into your disaster recovery plan. When seeing the corners where stale data tends to accumulate, the next question is, how do you tell when it’s quietly active in yours? Spotting the Signals of Stale Data Over the years, I’ve learned to watch for patterns like these: Lagging reality: Dashboards or analytics that consistently trail behind real events, even when pipelines look healthy.Phantom bugs: Issues that disappear after retries or re-deployments, only to return without code changes.Inconsistent truths: Two systems show different values for the same entity — prices, stock, balances — without a clear root cause.Process creep: Batch jobs or syncs that take longer every month, even when business volume hasn’t grown at the same pace.Operational tells: Teams relying on manual purges, ad-hoc scripts, or “refresh and check again” advice as standard troubleshooting steps. Signals spotted, hiding places uncovered — the next question is obvious: what do you actually do about it? Here is some practical advice. Keeping Data Fresh by Design Preventing stale data requires making freshness an architectural principle. It often starts with centralized cache management, because without a single policy for invalidation and refresh, caches across layers will drift apart. From there, real-time synchronization becomes critical, as relying on overnight jobs or delayed pipelines almost guarantees inconsistencies will creep in. But even when data moves in real time, correctness can’t be assumed. Automated quality checks, from anomaly detection to schema validation, are what keep silent errors from spreading across systems. And finally, no system operates in isolation. Imports and exports from external sources need fail-safes: guardrails that reject corrupted or outdated feeds before they poison downstream processes. Taken together, these practices shift data freshness from reactive firefighting to proactive governance, ensuring systems stay fast, consistent, and trustworthy. Fresh Data as an Ongoing Architectural Discipline In my experience, the cost of stale data rarely hits all at once — it creeps in. Performance slows a little, compliance checks get harder, and customer trust erodes one mismatch at a time. That’s why I see data freshness not as a cleanup task but as an ongoing architectural discipline. The good news is you don’t need to fix everything at once. Start by asking where stale data is most visible in your system today and treat that as your entry point to building resilience.

By Andreas Kozachenko
CNCF Triggers a Platform Parity Breakthrough for Arm64 and x86
CNCF Triggers a Platform Parity Breakthrough for Arm64 and x86

The Challenge Developing open-source software for deployment on Arm64 architecture requires a robust continuous integration and continuous deployment (CI/CD) environment. Yet, there has historically been a disparity between the levels of support for Arm64 and traditional x86 processor architectures, with Arm64 usually at a disadvantage. Developers of infrastructure components for multiple architectures have certain expectations of their work environments: Consistency of the tools and methods they use across platforms, so they don’t have to adopt different development procedures just to adopt a less prevalent platform.Performance from their platforms and support mechanisms, so their deployment schemes don’t suffer from speed deficiency when they choose to support multiple platforms.Testing coverage so the very same tests for efficiency, compliance, and security apply to all platforms simultaneously and without substantial differentiation.Maintainability, enabling developers to automate their integration and redevelopment processes so they apply to all platforms without alteration. Product managers for these same components have these same requirements, plus at least two more: Platform coverage capability, so that technical account managers (TAM) may have the skills and readiness they need to respond to customer needs.Support tiering capability, enabling TAM and other IT personnel to classify their levels of software support according to their capability to respond to urgent or emerging customer issues. The Solution Working in collaboration with both Ampere and infrastructure provider Equinix, open-source developer Alex Ellis made available his Actuated CI/CD platform to some of the most critical open-source projects in the cloud-native software ecosystem. Actuated takes GitHub self-hosted automation processes demonstrated by security engineers to be inherently vulnerable to malicious attack, and runs them in microVMs abstracted from the public Internet. Implementation Several key open-source Cloud Native Computing Foundation projects took advantage of an Actuated environment to run all of their GitHub Actions for Arm64. This environment is based on Ampere® Altra® processors made available with the help of infrastructure provider Equinix. The success of this initiative was instrumental in prompting GitHub to implement full support of the Arm64 architecture with GitHub Actions. Now, developers who had been running Arm64 build processes in QEMU emulation environments on x86 architectures can relocate those processes to Arm64 on bare metal. Self-Hosted Runners for GitHub Actions on ARM64 GitHub dominates the hosting of software projects these days. The most popular way that GitHub-hosted projects generate builds and releases for continuous integration is with the platform’s built-in CI toolset, GitHub Actions. The most important role played by the GitHub Actions CI/CD platform is automating software development pipelines. The party responsible for triggering any GitHub Action is a runner. It’s an agent running on a server, waiting for something to do and eager to do it once it’s given the assignment. It’s assigned a job from the workflow and tasked with getting it done. GitHub is a complete software deployment platform. As such, it hosts its own runners, each of which is adapted to its specified target environment and architecture. Until recently, GitHub did not offer hosted runner environments for Arm64. Projects that wanted to generate Arm64-native builds did have an option — the self-hosted runner. GitHub users could install an agent on a physical or virtual machine hosted elsewhere, and have GitHub Actions dispatch jobs to that host, managed by the project users. This required project administrators not only to manage the project itself but also to take care of the maintenance and security of the build environment that the projects would use. In CNCF’s case, developers took advantage of credits to Equinix Metal, enabling them to provision bare metal instances and use them as self-hosted runners for projects. But for a code lab whose projects must be made available 24/7/365 to other developers worldwide, the security of self-hosted runners poses a challenge: Anyone could clone the project repository, modify the Actions jobs, and get access to the runner node to run arbitrary jobs, according to this GitHub documentation. Another problem was ensuring consistency between CI runs. With self-hosted runners, if there were side effects of the CI jobs, such as configuration changes or files left behind afterwards, they would still be there for ensuing jobs. This posed a problem — when running a CI job to build or test software, you should have a controlled environment, so that the only thing that changes between runs is the software. In the case of self-hosted runners, the environment can drift over time. In the absence of a cleanup process, it was possible for runs of the same build job on the same host to generate different results over time. One way developers bypassed the need for Arm64 native runners was by running virtual Arm64 environments on x86 servers, using QEMU open-source emulation. Emulated environments add a huge performance overhead for software compilations, which run at a fraction of the pace of compilations on native, non-emulated hardware. Emulation worked well enough for developing small to medium projects. But if developers had to build something big and important for ARM64, the strain would become so great on their virtual environments that builds would completely fail. “In the past, people were doing builds using QEMU,” said Equinix’s Developer Partner Manager Ed Vielmetti. “Say you were building a compiler, where the intermediate steps require large amounts of memory and very deep integration with the processor. That just would not work in an emulated environment.” The Disparity Phenomenon Unlike the typical enterprise, the Cloud Native Computing Foundation has a special obligation to build its cloud-native components for all the world’s major processor architectures. Projects such as the containerd portable container runtime, the etcd key/value data store, the fluentd log data collector, the Falco real-time threat detection tool, and the OpenTelemetry observability and instrumentation toolkit, among dozens of others, are critical dependencies for the cloud-native ecosystem, and as such, must be built for both x86 and Arm64. To build low-level infrastructure components with support for Arm64, CNCF developers need access to native Arm64 infrastructure. This means, ironically, they need the very class of tools they’re trying to create. At first, Ampere and Equinix collaborated with CNCF to address these gaps by donating Ampere Altra-based servers or setting up Altra-based bare metal nodes at Equinix facilities. The granularity of the Arm64-based server resources that Equinix could share was bare metal nodes — a 160-core dual-socket Ampere Altra system. Ideally, a server like this would be shared among several projects, but this was, at the time, beyond the capabilities of the CNCF. This is the problem that Ampere and Actuated proposed to solve for CNCF by allowing multiple projects to run on fewer hosts, thus providing easy access to build services for more projects while consuming less hardware. “OpenTelemetry is a full-on, full-time-on, CI/CD system,” said Antoine Toulmé, Senior Engineering Manager for Blockchain and DLT and Splunk Maintainer for OpenTelemetry project. “We were able to leverage [our Ampere server] infrastructure for ourselves, but we weren’t able to share it with open source at large." “We cannot give GitHub runners away,” Toulmé said. “Once we were happy with certifying the downstream distributions to our customers, we opened issues with the OpenTelemetry project saying we would like to see ARM64 support being delivered at the highest level — meaning, it should run for every commit, it should run for main, it should run all the time. And the feedback was, well, great, but there are no ARM64 runners in GitHub. So we’re going to need you to work with what we can do here.” Due to the lack of readily available Arm64 platforms for these projects, developers were unaware if the changes they committed were causing issues on Arm64, as test suites were not run as frequently as for x86. Since container orchestration platforms are among the platforms being developed to support Arm64, this phenomenon became a vicious cycle: Releases were gated on passing integration test suites for x86, but releases were not gated on the same test suites passing for Arm64. The solution CNCF’s developers would discover falls far short of qualifying as radical or revolutionary — in fact, it’s more of a bug fix in practice. It’s so simple to implement that it completely compensates for this disparity, not just for CNCF but for any developer of any platform-level component for any architecture. Breakthrough: Actuated, Plus Editing One Line of Code To take the first step towards platform parity between x86 and Arm64, Ampere enlisted the help of Alex Ellis, the creator of a service called Actuated. It’s a product that runs GitHub Actions jobs in secure, isolated microVMs, instrumented to receive build jobs from GitHub Actions, and offering developers visibility into the performance of their build jobs and the load on the shared build systems. Actuated could run all the CNCF’s existing GitHub Actions runners after altering a single line of their configuration files, plus in some cases the pasting of a few code snippets — changes which took less than five minutes to implement. These changes enabled GitHub-hosted projects to point to Actuated’s microVM-driven environment on Ampere Altra processors for their build jobs. “Falco really needed Arm64 GitHub runners to elevate its support for the architecture and enlarge its user base,” Falco project Senior Open Source Engineer and Sysdig Maintainer Federico Di Pierro said. “[Actuated] was the perfect solution for us because it was easy to leverage and relieved any burden for the maintainers. This way, we as maintainers can focus on what really matters for the project, instead of fighting with maintaining and deploying self-hosted infrastructure. Now we are building, testing, and releasing artifacts for ARM64, leveraging Actuated for many of our projects, and it works flawlessly.” Having seen the increase in demand for Arm native build environments in recent years, GitHub announced last June the availability in public beta of Arm64-based hosted runners for GitHub Actions, powered by Ampere compute instances on Microsoft Azure, followed in January 2025 by the release into public preview of free hosted runners for public repositories. For OpenTelemetry, this means the end of network loads as high as 10 times their assigned bandwidth caps, on account of OpenTelemetry builds constantly downloading dependencies from Docker Hub repositories. “Yeah, we were definitely breaking things,” the OpenTelemetry Project’s Antoine Toulmé said. “We got lucky, because the Arm runners for GitHub shipped. We have moved to ARM runners, we are happy as can be, and nothing is breaking anymore.” Now for the first time, project maintainers can pay as close attention to the safety and security of Arm64 builds as they have for x86 builds, knowing that they’re no longer likely to encounter performance degradations or penalties. “[Actuated] gave us great confidence in the CI builds on ARM64,” Principal Software Engineer and AWS Maintainer for the containerd project Phil Estes said. “If the Arm CI breaks now, there’s no way we will merge that [pull request] until we figure out why... We have full confidence now that [build failures] are not an issue with flaky hardware [as they sometimes were before].” For its part, Oracle is continuing its policy of donating $3 million per year in OCI credits for Arm64 instances powered by Ampere to CNCF projects. This generosity, along with the newfound stability of Arm64 platforms catalyzed by Ampere and Equinix, and brought about by Actuated, is enabling prominent cloud infrastructure vendors, including Red Hat, SUSE, Canonical, and Mirantis, to provide full support for their enterprise customers who choose ARM64 infrastructure. Parity makes it possible for enterprises to make sensible choices about their computing infrastructure and platforms without incurring penalties just for choosing an alternative architecture. Large cloud customers are proving that Arm64 can provide organizations with the performance they need and reduced expenses for workloads — all with industry-leading energy efficiency. But organizations can’t experience those benefits until they can deploy their workloads on all infrastructure options on a level playing field with one another and measure the results for themselves. Leveling the Playing Field In early 2023, few options existed for GitHub-hosted projects that wanted to fully integrate Arm64 into their continuous integration processes. Through this initiative, leveraging an innovative software solution from Actuated with Ampere CPUs hosted by Equinix, we lowered the bar for CNCF projects to make a start towards parity of support for ARM64 and x86. Key cloud-native projects, including etcd, containerd, Open Telemetry, Falco, and others, were able to advance their support of Arm64, accelerate their CI runs on native Arm64 infrastructure, and support increasing numbers of their users taking advantage of ARM64 compute in the cloud. By the end of this pilot project, the number of options for developers has grown considerably. The CNCF now offers its projects the ability to run GitHub Actions jobs on managed Kubernetes clusters on OCI, using Ampere-powered instances and the GitHub project Actions Runner Controller, and with the addition of hosted Arm64 runners to GitHub, it has never been easier for projects to easily support this fast-growing and exciting architecture for cloud-native applications. Check out the full Ampere article collection here.

By Scott Fulton III
Advanced Snowflake SQL for Data Engineering Analytics
Advanced Snowflake SQL for Data Engineering Analytics

Snowflake is a cloud-native data platform known for its scalability, security, and excellent SQL engine, making it ideal for modern analytics workloads. Here in this article I made an attempt to deep dive into advanced SQL queries for online retail analytics, using Snowflake’s capabilities to have insights for trend analysis, customer segmentation, and user journey mapping with seven practical queries, each with a query flow, BI visualization, a system architecture diagram, and sample inputs/outputs based on a sample online retail dataset. Why Snowflake? Snowflake’s architecture separates compute and storage, enabling elastic scaling for large datasets. It supports semi-structured data (e.g., JSON, Avro) via native parsing, integrates with APIs, and offers features like time travel, row-level security, and zero-copy cloning for compliance and efficiency. These qualities make it a powerhouse for online retail analytics, from tracking seasonal trends to analyzing customer behavior. Scenario Context The examples below use a pseudo online retail platform, "ShopSphere," which tracks customer interactions (logins, purchases) and transaction values. The dataset includes two tables: event_log: Records user events (e.g., event_id, event_type, event_date, event_value, region, user_id, event_data for JSON).user: Stores user details (e.g., user_id, first_name, last_name). The queries are in a relatable business scenario, with sample data reflecting varied transaction amounts and regional differences. All sample data is synthetic, designed to demo query logic in an online retail setting. Getting Started With Snowflake To follow along, create a Snowflake database and load the sample tables. Below is the SQL to set up the event_log and User tables: SQL CREATE TABLE event_log ( event_id INT, event_type STRING, event_date DATE, event_value DECIMAL(10,2), region STRING, user_id INT, event_data VARIANT ); CREATE TABLE user ( user_id INT PRIMARY KEY, first_name STRING NOT NULL, last_name STRING NOT NULL ); Insert the sample data provided in each query section. Use a small virtual warehouse (X-Small) for testing, and ensure your role has appropriate permissions. For JSON queries, enable semi-structured data support by storing JSON in the event_data column. Advanced SQL Queries Below are seven advanced SQL queries showcasing Snowflake’s strengths, each with a query flow diagram, sample input/output, and Snowflake-specific enhancements. These queries build progressively, from basic aggregations to complex user journey analysis and JSON parsing, ensuring a logical flow for analyzing ShopSphere’s data. 1. Grouping Data by Year and Quarter This query aggregates events by year and quarter to analyze seasonal trends, critical for inventory planning or marketing campaigns. Query: SQL SELECT EXTRACT(YEAR FROM event_date) AS year, EXTRACT(QUARTER FROM event_date) AS quarter, COUNT(*) AS event_count, SUM(event_value) AS total_value FROM event_log GROUP BY year, quarter ORDER BY year, quarter; Explanation: The query extracts the year and quarter from event_date, counts events, and sums transaction values per group. Snowflake’s columnar storage optimizes grouping operations, even for large datasets. Snowflake Enhancements Scalability: Handles millions of rows with auto-scaling compute.Search optimization: Use search optimization on event_date to boost performance for frequent queries.Clustering: Cluster on event_date for faster aggregations. Sample input: The event_log table represents ShopSphere’s customer interactions in 2023. event_id event_type event_date event_value region user_id 1 Login 2023-01-15 0.00 US 101 2 Purchase 2023-02-20 99.99 EU 102 3 Login 2023-03-25 0.00 Asia 103 4 Purchase 2023-04-10 149.50 US 101 5 Login 2023-05-05 0.00 EU 102 6 Purchase 2023-06-15 75.25 Asia 103 Sample output: year quarter event_count total_value 2023 1 2 99.99 2023 2 3 224.75 2023 3 1 0.00 BI tool visualization: The bar chart below visualizes the event counts by quarter, highlighting seasonal patterns. Query flow: 2. Calculating Running Totals for Purchases Running totals track cumulative transaction values, useful for monitoring sales trends or detecting anomalies. Query: SQL --Running totals track cumulative transaction values, useful for monitoring sales trends or detecting anomalies. SELECT event_type, event_date, event_value, SUM(event_value) OVER (PARTITION BY event_type ORDER BY event_date) AS running_total FROM event_log WHERE event_type = 'Purchase' AND event_date BETWEEN '2023-01-01' AND '2023-06-30'; Explanation: This query calculates cumulative purchase values, ordered by date, building on Query 1’s aggregation by focusing on purchases. Snowflake’s window functions ensure efficient processing. Snowflake Enhancements Window functions: Optimized for high-performance analytics.Time travel: Use AT (OFFSET => -30) to query historical data.Zero-copy cloning: Test queries on cloned tables without duplicating storage. Sample input (Subset of event_log for purchases in 2023): event_id event_type event_date event_value 2 Purchase 2023-02-20 99.99 4 Purchase 2023-04-10 149.50 6 Purchase 2023-06-15 75.25 Sample output: event_type event_date event_value running_total Purchase 2023-02-20 99.99 99.99 Purchase 2023-04-10 149.50 249.49 Purchase 2023-06-15 75.25 324.74 BI visualization: The running total of purchase values over time, illustrating sales growth Query flow: 3. Computing Moving Averages for Login Frequency Moving averages smooth out fluctuations in login events, aiding user engagement analysis and complementing purchase trends from Query 2. Query: SQL SELECT event_date, COUNT(*) AS login_count, AVG(COUNT(*)) OVER (ORDER BY event_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_avg FROM event_log WHERE event_type = 'Login' GROUP BY event_date; Explanation: This query calculates a three-day moving average of daily login counts. The window frame ensures the average includes the current and two prior days. Snowflake Enhancements Window frames: Efficiently processes sliding windows.Materialized views: Precompute aggregates for faster reporting.Data sharing: Share results securely with marketing teams. Sample input (Subset of event_log for logins): event_id event_type event_date 1 Login 2023-01-15 3 Login 2023-01-16 5 Login 2023-01-17 7 Login 2023-01-18 Sample output: event_date login_count three_day_avg 2023-01-15 1 1.00 2023-01-16 1 1.00 2023-01-17 1 1.00 2023-01-18 1 1.00 BI visualization: Displays the three-day moving average of login counts, showing whether daily fluctuations exist or not. Query flow: 4. Time Series Analysis for Regional Purchases This query detects daily changes in purchase values by region, building on Query 2 to identify market-specific trends. Query: SQL SELECT event_date, region, event_value, event_value - LAG(event_value, 1) OVER (PARTITION BY region ORDER BY event_date) AS daily_difference FROM event_log WHERE event_type = 'Purchase' AND region = 'US'; Explanation: The LAG function retrieves the previous day’s purchase value, enabling daily difference calculations for the US region. Snowflake Enhancements Clustering: Cluster on region and event_date for faster queries.Query acceleration: Use Snowflake’s query acceleration service for large datasets.JSON support: Parse semi-structured data with FLATTEN for enriched analysis. Sample input (Subset of event_log for US purchases): event_date region event_value 2023-02-20 US 99.99 2023-04-10 US 149.50 Sample output: event_date region event_value daily_difference 2023-02-20 US 99.99 NULL 2023-04-10 US 149.50 49.51 BI visualization: The daily differences in purchase values for the US region, showing fluctuations. Query flow: 5. Generating Hierarchical Subtotals With ROLLUP ROLLUP creates subtotals for reporting, extending Query 1’s aggregations for financial summaries across years and regions. Query: SQL SELECT EXTRACT(YEAR FROM event_date) AS year, region, SUM(event_value) AS total_value FROM event_log WHERE event_type = 'Purchase' GROUP BY ROLLUP (year, region) ORDER BY year, region; Explanation: ROLLUP generates subtotals for each year and region, with NULL indicating higher-level aggregations (e.g., total per year or grand total). Snowflake Enhancements Materialized views: Precompute results for faster dashboards.Dynamic warehouses: Scale compute for complex aggregations.Security: Apply row-level security for region-specific access. Sample input (Subset of event_log for purchases): event_date region event_value 2023-02-20 EU 99.99 2023-04-10 US 149.50 2023-06-15 Asia 75.25 Sample output: year region total_value 2023 Asia 75.25 2023 EU 99.99 2023 US 149.50 2023 NULL 324.74 NULL NULL 324.74 BI visualization: Shows total purchase values by region for 2023, with a separate bar for the yearly total. Query flow: 6. Recursive CTE for Customer Purchase Paths This query uses a recursive CTE to trace customer purchase sequences, enabling user journey analysis for personalized marketing. Query: SQL WITH RECURSIVE purchase_path AS ( SELECT user_id, event_id, event_date, event_value, 1 AS path_level FROM event_log WHERE event_type = 'Purchase' AND event_date = (SELECT MIN(event_date) FROM event_log WHERE user_id = event_log.user_id AND event_type = 'Purchase') UNION ALL SELECT e.user_id, e.event_id, e.event_date, e.event_value, p.path_level + 1 FROM event_log e JOIN purchase_path p ON e.user_id = p.user_id AND e.event_date > p.event_date AND e.event_type = 'Purchase' ) SELECT u.user_id, u.first_name, u.last_name, p.event_date, p.event_value, p.path_level FROM purchase_path p JOIN user u ON p.user_id = u.user_id ORDER BY u.user_id, p.path_level; Explanation: The recursive CTE builds a sequence of purchases per user, starting with their first purchase. It tracks the order of purchases (path_level), useful for journey analysis. Snowflake Enhancements Recursive CTEs: Efficiently handles hierarchical data.Semi-structured data: Extract purchase details from JSON fields with FLATTEN.Performance: Optimize with clustering on user_id and event_date. Sample input user table: user_id first_name last_name 101 Alice Smith 102 Bob Johnson event_log (purchases): event_id user_id event_date event_value event_type 2 101 2023-02-20 99.99 Purchase 4 101 2023-04-10 149.50 Purchase 6 102 2023-06-15 75.25 Purchase Sample output: user_id first_name last_name event_date event_value path_level 101 Alice Smith 2023-02-20 99.99 1 101 Alice Smith 2023-04-10 149.50 2 102 Bob Johnson 2023-06-15 75.25 1 BI visualization: Shows purchase values by user and path level, illustrating customer purchase sequences. Query flow: 7. Parsing JSON Events This query extracts fields from semi-structured JSON data in event_log. Query: SQL SELECT e.event_date, e.event_data:product_id::INT AS product_id, e.event_data:category::STRING AS category FROM event_log e WHERE e.event_type = 'Purchase' AND e.event_data IS NOT NULL; Explanation: The query uses Snowflake’s dot notation to parse JSON fields (product_id, category) from the event_data column, enabling detailed product analysis. This builds on previous queries by adding semi-structured data capabilities. Snowflake Enhancements Native JSON support: Parse JSON without external tools.Schema-on-read: Handle evolving JSON schemas dynamically.Performance: Use VARIANT columns for efficient JSON storage. Sample input (Subset of event_log with JSON data): event_id event_date event_type event_data 2 2023-02-20 Purchase {"product_id": 101, "category": "Electronics"} 4 2023-04-10 Purchase {"product_id": 102, "category": "Clothing"} Sample output: event_date product_id category 2023-02-20 101 Electronics 2023-04-10 102 Clothing BI visualization: Shows the distribution of purchases by product category, highlighting category popularity. Query flow diagram System Architecture Description of Snowflake’s role in ShopSphere’s data ecosystem, integrating with external sources, ETL tools, and BI platforms. Explanation: The system architecture diagram is structured in four layers to reflect the data lifecycle in ShopSphere’s ecosystem, using distinct shapes for clarity: External data sources: CRM systems and API feeds, shown as ellipses, provide raw customer and transaction data, forming the pipeline’s input.Snowflake data platform: Snowflake’s cloud storage and virtual warehouses store and process data, serving as the core analytics engine.ETL tools: Tools like dbt and Airflow transform and orchestrate data, indicating decision-driven processes.BI tools: Tableau and Power BI, visualize query results as dashboards and reports, symbolizing output storage. Practical Considerations The following considerations ensure the queries are robust in real-world scenarios, building on the technical foundation established above. Performance Optimization Clustering keys: Use clustering on high-cardinality columns (e.g., user_id, event_date) to improve query performance for large datasets.Query acceleration: Enable Snowflake’s query acceleration service for complex queries on massive datasets.Cost management: Monitor compute usage and scale down warehouses during low-demand periods to optimize costs. Data Quality Handling edge cases: Account for missing data (for instance, NULL values in event_value) or duplicates (e.g., multiple purchases on the same day) by adding DISTINCT or filtering clauses.Data skew: High purchase volumes in Q4 may cause performance issues; partition tables or use APPROX_COUNT_DISTINCT for scalability. Security and Compliance Row-level security: Implement policies to restrict access to sensitive data (for example, region-specific results).Data masking: Apply dynamic data masking for compliance with GDPR or CCPA when sharing reports with external partners. Conclusion Snowflake’s advanced SQL capabilities, combined with its scalable architecture and features like time travel, semi-structured data support, and zero-copy cloning, make it a powerful online retail analytics platform. The queries and diagrams in this ShopSphere scenario demonstrate how to find insights for seasonal trends, customer segmentation, user journey mapping, and product analysis. Business Impact These queries enable ShopSphere to optimize operations and drive growth: Query 1’s seasonal trends informed a 15% increase in Q4 inventory, boosting sales. Query 6’s user journey analysis improved customer retention by 10% through targeted campaigns for repeat buyers. Query 7’s JSON parsing enabled precise product category analysis, optimizing marketing spend. Together, these insights empower data-driven decisions that enhance profit and customer satisfaction.

By Ram Ghadiyaram DZone Core CORE

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Level Up Your Engineering Workflow with Copilot Templates

October 17, 2025 by Lakshmi Narayana Rasalay

DevEx Ambient Agent With Advanced Knowledge Graph

October 17, 2025 by Chris Shayan

Why Domain-Driven Design Is Still Essential in Modern Software Development

October 15, 2025 by Otavio Santana DZone Core CORE

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

AI-Powered Cybersecurity: Inside Google’s Gemini and Microsoft’s Security Copilot

October 17, 2025 by Sairamakrishna BuchiReddy Karri

Beyond Keywords: Modernizing Enterprise Search with Vector Databases

October 17, 2025 by Lakshmi Narayana Rasalay

PostgreSQL Full-Text Search vs. Pattern Matching: A Performance Comparison

October 17, 2025 by Horatiu Dan

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

Event-Driven Chaos Engineering: From Failure to Resilience in Kubernetes

October 17, 2025 by Josephine Eskaline Joyce DZone Core CORE

Beyond Keywords: Modernizing Enterprise Search with Vector Databases

October 17, 2025 by Lakshmi Narayana Rasalay

PostgreSQL Full-Text Search vs. Pattern Matching: A Performance Comparison

October 17, 2025 by Horatiu Dan

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

Event-Driven Chaos Engineering: From Failure to Resilience in Kubernetes

October 17, 2025 by Josephine Eskaline Joyce DZone Core CORE

Beyond Keywords: Modernizing Enterprise Search with Vector Databases

October 17, 2025 by Lakshmi Narayana Rasalay

PostgreSQL Full-Text Search vs. Pattern Matching: A Performance Comparison

October 17, 2025 by Horatiu Dan

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

Event-Driven Chaos Engineering: From Failure to Resilience in Kubernetes

October 17, 2025 by Josephine Eskaline Joyce DZone Core CORE

Beyond Keywords: Modernizing Enterprise Search with Vector Databases

October 17, 2025 by Lakshmi Narayana Rasalay

DevEx Ambient Agent With Advanced Knowledge Graph

October 17, 2025 by Chris Shayan

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

AI-Powered Cybersecurity: Inside Google’s Gemini and Microsoft’s Security Copilot

October 17, 2025 by Sairamakrishna BuchiReddy Karri

We Tested Context7 With ZK Documentation: Here's What We Learned

October 17, 2025 by Hawk Chen DZone Core CORE

From Ticking Time Bomb to Trustworthy AI: A Cohesive Blueprint for AI Safety

October 16, 2025 by Anna Bulinova

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: